Facies classification using Machine Learning

Introduction

Author: Anjum Sayed. My prediction approach is as follows:

"Upsampling" the log data to a more conventional sample rate
Testing different classifiers:
1. SVM (using GridSearchCV to find best parameters)
2. DecisionTree (using GridSearchCV to find best parameters)
3. XGBoost
4. Naive Bayes
5. AdaBoost
6. RandomForest
7. Nearest neighbour (using GridSearchCV to find best parameters)
8. DNN TensorFlow classifier
9. LSTM TensorFlow classifier (TODO)
Taking the 5 best classifiers and creating an ensemble majority vote classifier

There are many ways to improve on my method. See the future work at the end section for ideas.

Exploring the dataset

First, we will examine the data set we will use to train the classifier. The training data is contained in the file facies_vectors.csv. The dataset consists of 5 wireline log measurements, two indicator variables and a facies label at half foot intervals. In machine learning terminology, each log measurement is a feature vector that maps a set of 'features' (the log measurements) to a class (the facies type). We will use the pandas library to load the data into a dataframe, which provides a convenient data structure to work with well log data.



In [1]:

    
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors
from mpl_toolkits.axes_grid1 import make_axes_locatable

from pandas import set_option
set_option("display.max_rows", 10)
pd.options.mode.chained_assignment = None

filename = 'facies_vectors.csv'
training_data = pd.read_csv(filename)
training_data









    Out[1]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      0
      3
      A1 SH
      SHRIMPLIN
      2793.0
      77.450
      0.664
      9.900
      11.915
      4.600
      1
      1.000
    
    
      1
      3
      A1 SH
      SHRIMPLIN
      2793.5
      78.260
      0.661
      14.200
      12.565
      4.100
      1
      0.979
    
    
      2
      3
      A1 SH
      SHRIMPLIN
      2794.0
      79.050
      0.658
      14.800
      13.050
      3.600
      1
      0.957
    
    
      3
      3
      A1 SH
      SHRIMPLIN
      2794.5
      86.100
      0.655
      13.900
      13.115
      3.500
      1
      0.936
    
    
      4
      3
      A1 SH
      SHRIMPLIN
      2795.0
      74.580
      0.647
      13.500
      13.300
      3.400
      1
      0.915
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      4144
      5
      C LM
      CHURCHMAN BIBLE
      3120.5
      46.719
      0.947
      1.828
      7.254
      3.617
      2
      0.685
    
    
      4145
      5
      C LM
      CHURCHMAN BIBLE
      3121.0
      44.563
      0.953
      2.241
      8.013
      3.344
      2
      0.677
    
    
      4146
      5
      C LM
      CHURCHMAN BIBLE
      3121.5
      49.719
      0.964
      2.925
      8.013
      3.190
      2
      0.669
    
    
      4147
      5
      C LM
      CHURCHMAN BIBLE
      3122.0
      51.469
      0.965
      3.083
      7.708
      3.152
      2
      0.661
    
    
      4148
      5
      C LM
      CHURCHMAN BIBLE
      3122.5
      50.031
      0.970
      2.609
      6.668
      3.295
      2
      0.653
    
  

4149 rows × 11 columns

This data is from the Council Grove gas reservoir in Southwest Kansas. The Panoma Council Grove Field is predominantly a carbonate gas reservoir encompassing 2700 square miles in Southwestern Kansas. This dataset is from nine wells (with 4149 examples), consisting of a set of seven predictor variables and a rock facies (class) for each example vector and validation (test) data (830 examples from two wells) having the same seven predictor variables in the feature vector. Facies are based on examination of cores from nine wells taken vertically at half-foot intervals. Predictor variables include five from wireline log measurements and two geologic constraining variables that are derived from geologic knowledge. These are essentially continuous variables sampled at a half-foot sample rate.

The seven predictor variables are:

Five wire line log curves include gamma ray (GR), resistivity logging (ILD_log10), photoelectric effect (PE), neutron-density porosity difference and average neutron-density porosity (DeltaPHI and PHIND). Note, some wells do not have PE.
Two geologic constraining variables: nonmarine-marine indicator (NM_M) and relative position (RELPOS)

The nine discrete facies (classes of rocks) are:

Nonmarine sandstone
Nonmarine coarse siltstone
Nonmarine fine siltstone
Marine siltstone and shale
Mudstone (limestone)
Wackestone (limestone)
Dolomite
Packstone-grainstone (limestone)
Phylloid-algal bafflestone (limestone)

These facies aren't discrete, and gradually blend into one another. Some have neighboring facies that are rather close. Mislabeling within these neighboring facies can be expected to occur. The following table lists the facies, their abbreviated labels and their approximate neighbors.

Facies	Label	Adjacent Facies
1	SS	2
2	CSiS	1,3
3	FSiS	2
4	SiSh	5
5	MS	4,6
6	WS	5,7
7	D	6,8
8	PS	6,7,9
9	BS	7,8

Let's clean up this dataset. The 'Well Name' and 'Formation' columns can be turned into a categorical data type.



In [2]:

    
training_data['Well Name'] = training_data['Well Name'].astype('category')
training_data['Formation'] = training_data['Formation'].astype('category')
training_data['Well Name'].unique()









    Out[2]:





[SHRIMPLIN, ALEXANDER D, SHANKLE, LUKE G U, KIMZEY A, CROSS H CATTLE, NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]
Categories (10, object): [SHRIMPLIN, ALEXANDER D, SHANKLE, LUKE G U, ..., NOLAN, Recruit F9, NEWBY, CHURCHMAN BIBLE]



In [3]:

    
# Drop the rows with missing PEF values
training_data.dropna(inplace=True)



In [4]:

    
training_data.describe()









    Out[4]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
      3232.000000
    
    
      mean
      4.422030
      2875.824567
      66.135769
      0.642719
      3.559642
      13.483213
      3.725014
      1.498453
      0.520287
    
    
      std
      2.504243
      131.006274
      30.854826
      0.241845
      5.228948
      7.698980
      0.896152
      0.500075
      0.286792
    
    
      min
      1.000000
      2573.500000
      13.250000
      -0.025949
      -21.832000
      0.550000
      0.200000
      1.000000
      0.010000
    
    
      25%
      2.000000
      2791.000000
      46.918750
      0.492750
      1.163750
      8.346750
      3.100000
      1.000000
      0.273000
    
    
      50%
      4.000000
      2893.500000
      65.721500
      0.624437
      3.500000
      12.150000
      3.551500
      1.000000
      0.526000
    
    
      75%
      6.000000
      2980.000000
      79.626250
      0.812735
      6.432500
      16.453750
      4.300000
      2.000000
      0.767250
    
    
      max
      9.000000
      3122.500000
      361.150000
      1.480000
      18.600000
      84.400000
      8.094000
      2.000000
      1.000000

This is a quick view of the statistical distribution of the input variables. Looking at the count values, there are 3232 feature vectors in the training set.

These are the names of the 10 training wells in the Council Grove reservoir. Data has been recruited into pseudo-well 'Recruit F9' to better represent facies 9, the Phylloid-algal bafflestone.

Before we plot the well data, let's define a color map so the facies are represented by consistent color in all the plots in this tutorial. We also create the abbreviated facies labels, and add those to the facies_vectors dataframe.



In [5]:

    
# 1=sandstone  2=c_siltstone   3=f_siltstone 
# 4=marine_silt_shale 5=mudstone 6=wackestone 7=dolomite
# 8=packstone 9=bafflestone
facies_colors = ['#F4D03F', '#F5B041','#DC7633','#6E2C00', '#1B4F72','#2E86C1', '#AED6F1', '#A569BD', '#196F3D']
facies_labels = ['SS', 'CSiS', 'FSiS', 'SiSh', 'MS', 'WS', 'D','PS', 'BS']
#facies_color_map is a dictionary that maps facies labels
#to their respective colors
facies_color_map = {}
for ind, label in enumerate(facies_labels):
    facies_color_map[label] = facies_colors[ind]

def label_facies(row, labels):
    return labels[ row['Facies'] -1]
    
training_data.loc[:,'FaciesLabels'] = training_data.apply(lambda row: label_facies(row, facies_labels), axis=1)

Let's take a look at the data from individual wells in a more familiar log plot form. We will create plots for the five well log variables, as well as a log for facies labels. The plots are based on the those described in Alessandro Amato del Monte's excellent tutorial.



In [6]:

    
def make_facies_log_plot(logs, facies_colors):
    #make sure logs are sorted by depth
    logs = logs.sort_values(by='Depth')
    cmap_facies = colors.ListedColormap(
            facies_colors[0:len(facies_colors)], 'indexed')
    
    ztop=logs.Depth.min(); zbot=logs.Depth.max()
    
    cluster=np.repeat(np.expand_dims(logs['Facies'].values,1), 100, 1)
    
    f, ax = plt.subplots(nrows=1, ncols=6, figsize=(8, 12))
    ax[0].plot(logs.GR, logs.Depth, '-g')
    ax[1].plot(logs.ILD_log10, logs.Depth, '-')
    ax[2].plot(logs.DeltaPHI, logs.Depth, '-', color='0.5')
    ax[3].plot(logs.PHIND, logs.Depth, '-', color='r')
    ax[4].plot(logs.PE, logs.Depth, '-', color='black')
    im=ax[5].imshow(cluster, interpolation='none', aspect='auto',
                    cmap=cmap_facies,vmin=1,vmax=9)
    
    divider = make_axes_locatable(ax[5])
    cax = divider.append_axes("right", size="20%", pad=0.05)
    cbar=plt.colorbar(im, cax=cax)
    cbar.set_label((17*' ').join([' SS ', 'CSiS', 'FSiS', 'SiSh', ' MS ', ' WS ', ' D  ', ' PS ', ' BS ']))
    cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
    
    for i in range(len(ax)-1):
        ax[i].set_ylim(ztop,zbot)
        ax[i].invert_yaxis()
        ax[i].grid()
        ax[i].locator_params(axis='x', nbins=3)
    
    ax[0].set_xlabel("GR")
    ax[0].set_xlim(logs.GR.min(),logs.GR.max())
    ax[1].set_xlabel("ILD_log10")
    ax[1].set_xlim(logs.ILD_log10.min(),logs.ILD_log10.max())
    ax[2].set_xlabel("DeltaPHI")
    ax[2].set_xlim(logs.DeltaPHI.min(),logs.DeltaPHI.max())
    ax[3].set_xlabel("PHIND")
    ax[3].set_xlim(logs.PHIND.min(),logs.PHIND.max())
    ax[4].set_xlabel("PE")
    ax[4].set_xlim(logs.PE.min(),logs.PE.max())
    ax[5].set_xlabel('Facies')
    
    ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([])
    ax[4].set_yticklabels([]); ax[5].set_yticklabels([])
    ax[5].set_xticklabels([])
    f.suptitle('Well: %s'%logs.iloc[0]['Well Name'], fontsize=14,y=0.94)

Placing the log plotting code in a function will make it easy to plot the logs from multiples wells, and can be reused later to view the results when we apply the facies classification model to other wells. The function was written to take a list of colors and facies labels as parameters.

We then show log plots for wells SHRIMPLIN.



In [7]:

    
make_facies_log_plot(training_data[training_data['Well Name'] == 'SHRIMPLIN'], facies_colors)

In addition to individual wells, we can look at how the various facies are represented by the entire training set. Let's plot a histogram of the number of training examples for each facies class.



In [8]:

    
#count the number of unique entries for each facies, sort them by
#facies number (instead of by number of entries)
facies_counts = training_data['Facies'].value_counts().sort_index()
#use facies labels to index each count
facies_counts.index = facies_labels

facies_counts.plot(kind='bar',color=facies_colors, title='Distribution of Training Data by Facies')
facies_counts









    Out[8]:





SS      259
CSiS    738
FSiS    615
SiSh    184
MS      217
WS      462
D        98
PS      498
BS      161
Name: Facies, dtype: int64

This shows the distribution of examples by facies for the examples in the training set. Dolomite (facies 7) has the fewest with 81 examples. Depending on the performance of the classifier we are going to train, we may consider getting more examples of these facies.

Crossplots are a familiar tool in the geosciences to visualize how two properties vary with rock type. This dataset contains 5 log variables, and scatter matrix can help to quickly visualize the variation between the all the variables in the dataset. We can employ the very useful Seaborn library to quickly create a nice looking scatter matrix. Each pane in the plot shows the relationship between two of the variables on the x and y axis, with each point is colored according to its facies. The same colormap is used to represent the 9 facies.



In [9]:

    
#save plot display settings to change back to when done plotting with seaborn
inline_rc = dict(mpl.rcParams)

import seaborn as sns
sns.set()
sns.pairplot(training_data.drop(['Well Name','Facies','Formation','Depth','NM_M','RELPOS'],axis=1),
             hue='FaciesLabels', palette=facies_color_map,
             hue_order=list(reversed(facies_labels)))

#switch back to default matplotlib plot style
mpl.rcParams.update(inline_rc)

"Upsampling" the dataset

The supplied training data has a sampling rate of 0.5m, which is lower than the industry standard of 0.1524m. This means the the number of observations is a little on the small side, meaning that many ML classifiers will always perform poorly, especially with high entropy datasets.

One workaround to this will be to increase the sampling rate to 0.1m, by using a cubic spline to fill in the gaps in the data. Making up data is generally a no-no, but since wireline logs are generally heavily smoothed by the vendors, this additional step shouldn't add too much error, but will give us 5x more data to play with. We'll do this for each individual well (rather than the whole dataset) to prevent interpolation between wells.



In [10]:

    
upsampled_data = pd.DataFrame()
for well in training_data['Well Name'].unique():
    df = training_data[training_data['Well Name'] == well]
    df.index = np.arange(0, 5*len(df), 5)
    upsampled_df = pd.DataFrame(index=np.arange(0, 5*len(df)))
    upsampled_df = upsampled_df.join(df)
    upsampled_df.interpolate(method='cubic', limit=4, inplace=True)
    upsampled_df.fillna(method="pad", limit=4, inplace=True)
    upsampled_df.drop_duplicates(inplace=True)
    if len(upsampled_data) == 0:
        upsampled_data = upsampled_df
    else:
        upsampled_data = upsampled_data.append(upsampled_df, ignore_index=True)

upsampled_data["Facies"] = upsampled_data["Facies"].round()
upsampled_data["Facies"] = upsampled_data["Facies"].astype(int)
upsampled_data["NM_M"] = upsampled_data["NM_M"].round()
upsampled_data["NM_M"] = upsampled_data["NM_M"].astype(int)

# Sometimes a small number of the facies are labelled as 0 or 10 - these need to be removed
upsampled_data = upsampled_data[upsampled_data.Facies != 0]
upsampled_data = upsampled_data[upsampled_data.Facies != 10]
upsampled_data.loc[:,'FaciesLabels'] = upsampled_data.apply(lambda row: label_facies(row, facies_labels), axis=1)



In [11]:

    
upsampled_data









    Out[11]:






  
    
      
      Facies
      Formation
      Well Name
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
      FaciesLabels
    
  
  
    
      0
      3
      A1 SH
      SHRIMPLIN
      2793.0
      77.450000
      0.664000
      9.900000
      11.915000
      4.600000
      1
      1.000000
      FSiS
    
    
      1
      3
      A1 SH
      SHRIMPLIN
      2793.1
      79.284515
      0.663545
      11.150359
      12.015714
      4.546545
      1
      0.996012
      FSiS
    
    
      2
      3
      A1 SH
      SHRIMPLIN
      2793.2
      79.937114
      0.662987
      12.190536
      12.139619
      4.460793
      1
      0.991895
      FSiS
    
    
      3
      3
      A1 SH
      SHRIMPLIN
      2793.3
      79.753455
      0.662357
      13.035533
      12.278167
      4.351767
      1
      0.987673
      FSiS
    
    
      4
      3
      A1 SH
      SHRIMPLIN
      2793.4
      79.079198
      0.661685
      13.700354
      12.422809
      4.228495
      1
      0.983367
      FSiS
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      16122
      5
      C LM
      CHURCHMAN BIBLE
      3122.1
      51.177622
      0.964821
      3.026104
      7.597464
      3.161609
      2
      0.659412
      MS
    
    
      16123
      5
      C LM
      CHURCHMAN BIBLE
      3122.2
      50.791689
      0.964974
      2.945037
      7.451181
      3.179267
      2
      0.657823
      MS
    
    
      16124
      5
      C LM
      CHURCHMAN BIBLE
      3122.3
      50.407644
      0.965716
      2.845018
      7.256165
      3.206420
      2
      0.656228
      MS
    
    
      16125
      5
      C LM
      CHURCHMAN BIBLE
      3122.4
      50.121933
      0.967306
      2.731266
      6.999433
      3.244515
      2
      0.654622
      MS
    
    
      16126
      5
      C LM
      CHURCHMAN BIBLE
      3122.5
      50.031000
      0.970000
      2.609000
      6.668000
      3.295000
      2
      0.653000
      MS
    
  

16122 rows × 12 columns

Let's check if the facies distributions still look right



In [12]:

    
upsampled_data.describe()









    Out[12]:






  
    
      
      Facies
      Depth
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      16122.000000
      16122.000000
      16122.000000
      16122.000000
      16122.000000
      16122.000000
      16122.000000
      16122.000000
      16122.000000
    
    
      mean
      4.420233
      2875.779266
      66.132765
      0.642385
      3.562926
      13.488510
      3.724287
      1.498325
      0.520044
    
    
      std
      2.492889
      130.886683
      30.795497
      0.241821
      5.207364
      7.684145
      0.894569
      0.500013
      0.284700
    
    
      min
      1.000000
      2573.500000
      12.759470
      -0.026553
      -21.923002
      -1.112217
      0.058033
      1.000000
      -0.096527
    
    
      25%
      2.000000
      2791.000000
      46.939636
      0.490246
      1.144971
      8.359510
      3.113000
      1.000000
      0.279000
    
    
      50%
      4.000000
      2893.600003
      65.816922
      0.624654
      3.461482
      12.138445
      3.546221
      1.000000
      0.525477
    
    
      75%
      6.000000
      2980.000000
      79.864096
      0.812950
      6.483459
      16.492567
      4.306479
      2.000000
      0.761796
    
    
      max
      9.000000
      3122.500000
      375.701116
      1.487703
      20.032203
      84.606003
      8.094000
      2.000000
      1.201765



In [13]:

    
facies_counts = upsampled_data['Facies'].value_counts().sort_index()
facies_counts.index = facies_labels

facies_counts.plot(kind='bar',color=facies_colors, title='Distribution of Training Data by Facies')
facies_counts









    Out[13]:





SS      1321
CSiS    3650
FSiS    2967
SiSh     968
MS      1149
WS      2261
D        748
PS      2200
BS       858
Name: Facies, dtype: int64

Looks good! We'll now use this upsampled data as our training data



In [14]:

    
training_data = upsampled_data

Conditioning the data set

Now we extract just the feature variables we need to perform the classification. The predictor variables are the five wireline values and two geologic constraining variables. We also get a vector of the facies labels that correspond to each feature vector.



In [15]:

    
correct_facies_labels = training_data['Facies'].values
well_names = training_data['Well Name']

feature_vectors = training_data.drop(['Formation', 'Well Name', 'Depth','Facies','FaciesLabels'], axis=1)
feature_vectors.describe()









    Out[15]:






  
    
      
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      count
      16122.000000
      16122.000000
      16122.000000
      16122.000000
      16122.000000
      16122.000000
      16122.000000
    
    
      mean
      66.132765
      0.642385
      3.562926
      13.488510
      3.724287
      1.498325
      0.520044
    
    
      std
      30.795497
      0.241821
      5.207364
      7.684145
      0.894569
      0.500013
      0.284700
    
    
      min
      12.759470
      -0.026553
      -21.923002
      -1.112217
      0.058033
      1.000000
      -0.096527
    
    
      25%
      46.939636
      0.490246
      1.144971
      8.359510
      3.113000
      1.000000
      0.279000
    
    
      50%
      65.816922
      0.624654
      3.461482
      12.138445
      3.546221
      1.000000
      0.525477
    
    
      75%
      79.864096
      0.812950
      6.483459
      16.492567
      4.306479
      2.000000
      0.761796
    
    
      max
      375.701116
      1.487703
      20.032203
      84.606003
      8.094000
      2.000000
      1.201765

Scikit includes a preprocessing module that can 'standardize' the data (giving each variable zero mean and unit variance, also called whitening). Many machine learning algorithms assume features will be standard normally distributed data (ie: Gaussian with zero mean and unit variance). The factors used to standardize the training set must be applied to any subsequent feature set that will be input to the classifier. The StandardScalar class can be fit to the training set, and later used to standardize any training data.



In [16]:

    
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(feature_vectors)
scaled_features = scaler.transform(feature_vectors)



In [17]:

    
feature_vectors









    Out[17]:






  
    
      
      GR
      ILD_log10
      DeltaPHI
      PHIND
      PE
      NM_M
      RELPOS
    
  
  
    
      0
      77.450000
      0.664000
      9.900000
      11.915000
      4.600000
      1
      1.000000
    
    
      1
      79.284515
      0.663545
      11.150359
      12.015714
      4.546545
      1
      0.996012
    
    
      2
      79.937114
      0.662987
      12.190536
      12.139619
      4.460793
      1
      0.991895
    
    
      3
      79.753455
      0.662357
      13.035533
      12.278167
      4.351767
      1
      0.987673
    
    
      4
      79.079198
      0.661685
      13.700354
      12.422809
      4.228495
      1
      0.983367
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      16122
      51.177622
      0.964821
      3.026104
      7.597464
      3.161609
      2
      0.659412
    
    
      16123
      50.791689
      0.964974
      2.945037
      7.451181
      3.179267
      2
      0.657823
    
    
      16124
      50.407644
      0.965716
      2.845018
      7.256165
      3.206420
      2
      0.656228
    
    
      16125
      50.121933
      0.967306
      2.731266
      6.999433
      3.244515
      2
      0.654622
    
    
      16126
      50.031000
      0.970000
      2.609000
      6.668000
      3.295000
      2
      0.653000
    
  

16122 rows × 7 columns

Scikit also includes a handy function to randomly split the training data into training and test sets. The test set contains a small subset of feature vectors that are not used to train the network. Because we know the true facies labels for these examples, we can compare the results of the classifier to the actual facies and determine the accuracy of the model. Let's use 20% of the data for the test set.



In [18]:

    
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(scaled_features, correct_facies_labels, test_size=0.1, random_state=48)

Training the SVM classifier

Now we use the cleaned and conditioned training set to create a facies classifier. As mentioned above, we will use a type of machine learning model known as a support vector machine. The SVM is a map of the feature vectors as points in a multi dimensional space, mapped so that examples from different facies are divided by a clear gap that is as wide as possible.

The SVM implementation in scikit-learn takes a number of important parameters. First we create a classifier using the default settings.



In [19]:

    
from sklearn.svm import SVC

clf = SVC()
clf.fit(X_train, y_train)
predicted_labels = clf.predict(X_test)

Now we can train the classifier using the training set we created above.

Now that the model has been trained on our data, we can use it to predict the facies of the feature vectors in the test set. Because we know the true facies labels of the vectors in the test set, we can use the results to evaluate the accuracy of the classifier.

We need some metrics to evaluate how good our classifier is doing. A confusion matrix is a table that can be used to describe the performance of a classification model. Scikit-learn allows us to easily create a confusion matrix by supplying the actual and predicted facies labels.

The confusion matrix is simply a 2D array. The entries of confusion matrix C[i][j] are equal to the number of observations predicted to have facies j, but are known to have facies i.

To simplify reading the confusion matrix, a function has been written to display the matrix along with facies labels and various error metrics. See the file classification_utilities.py in this repo for the display_cm() function.



In [20]:

    
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from classification_utilities import display_cm, display_adj_cm

conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    75    55     4                 1                     135
     CSiS    12   282    47           1     1                     343
     FSiS     6    92   189     1           1                     289
     SiSh           3     8    64     3    26     1     1         106
       MS           2     1    11    33    55     3     5         110
       WS                 2     8    11   166     3    38     1   229
        D                 1     3     3    15    35    24     1    82
       PS                 2     5     2    41     7   168     3   228
       BS                                   3     2    19    67    91

Precision  0.81  0.65  0.74  0.70  0.62  0.54  0.69  0.66  0.93  0.68
   Recall  0.56  0.82  0.65  0.60  0.30  0.72  0.43  0.74  0.74  0.67
       F1  0.66  0.73  0.70  0.65  0.40  0.62  0.53  0.70  0.82  0.66

The rows of the confusion matrix correspond to the actual facies labels. The columns correspond to the labels assigned by the classifier. For example, consider the first row. For the feature vectors in the test set that actually have label SS, 23 were correctly indentified as SS, 21 were classified as CSiS and 2 were classified as FSiS.

The entries along the diagonal are the facies that have been correctly classified. Below we define two functions that will give an overall value for how the algorithm is performing. The accuracy is defined as the number of correct classifications divided by the total number of classifications.



In [21]:

    
def accuracy(conf):
    total_correct = 0.
    nb_classes = conf.shape[0]
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
    acc = total_correct/sum(sum(conf))
    return acc

As noted above, the boundaries between the facies classes are not all sharp, and some of them blend into one another. The error within these 'adjacent facies' can also be calculated. We define an array to represent the facies adjacent to each other. For facies label i, adjacent_facies[i] is an array of the adjacent facies labels.



In [22]:

    
adjacent_facies = np.array([[1], [0,2], [1], [4], [3,5], [4,6,7], [5,7], [5,6,8], [6,7]])

def accuracy_adjacent(conf, adjacent_facies):
    nb_classes = conf.shape[0]
    total_correct = 0.
    for i in np.arange(0,nb_classes):
        total_correct += conf[i][i]
        for j in adjacent_facies[i]:
            total_correct += conf[i][j]
    return total_correct / sum(sum(conf))



In [23]:

    
print('Facies classification accuracy = %f' % accuracy(conf))
print('Adjacent facies classification accuracy = %f' % accuracy_adjacent(conf, adjacent_facies))









    



Facies classification accuracy = 0.668940
Adjacent facies classification accuracy = 0.940484

SVM parameter selection

The classifier so far has been built with the default parameters. However, we may be able to get improved classification results with optimal parameter choices.

We will consider two parameters. The parameter C is a regularization factor, and tells the classifier how much we want to avoid misclassifying training examples. A large value of C will try to correctly classify more examples from the training set, but if C is too large it may 'overfit' the data and fail to generalize when classifying new data. If C is too small then the model will not be good at fitting outliers and will have a large error on the training set.

The SVM learning algorithm uses a kernel function to compute the distance between feature vectors. Many kernel functions exist, but in this case we are using the radial basis function rbf kernel (the default). The gamma parameter describes the size of the radial basis functions, which is how far away two vectors in the feature space need to be to be considered close.

We will train a series of classifiers with different values for C and gamma. Two nested loops are used to train a classifier for every possible combination of values in the ranges specified. The classification accuracy is recorded for each combination of parameter values. The results are shown in a series of plots, so the parameter values that give the best classification accuracy on the test set can be selected.

This process is also known as 'cross validation'. Often a separate 'cross validation' dataset will be created in addition to the training and test sets to do model selection. For this tutorial we will just use the train set to choose model parameters within GridSearchCV.



In [24]:

    
from sklearn.model_selection import GridSearchCV

parameters = {'C': [.01, 1, 5, 10, 20, 50, 100, 1000, 5000, 10000], 
              'gamma': [0.0001, 0.001, 0.01, 0.1, 1, 10], 
              'kernel': ['rbf']}  # This could be extended to the linear kernel but it takes a long time

svr = SVC()
clf = GridSearchCV(svr, parameters, n_jobs=-1, verbose=3, scoring="f1_micro")
clf.fit(X_train, y_train)









    



Fitting 3 folds for each of 60 candidates, totalling 180 fits






    



[Parallel(n_jobs=-1)]: Done  80 tasks      | elapsed:   42.2s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:  2.3min finished






    Out[24]:





GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'gamma': [0.0001, 0.001, 0.01, 0.1, 1, 10], 'kernel': ['rbf'], 'C': [0.01, 1, 5, 10, 20, 50, 100, 1000, 5000, 10000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1_micro', verbose=3)

Let's plot the GridSearchCV results in a heatmap



In [25]:

    
cv_results = {"C": clf.cv_results_["param_C"], "gamma": clf.cv_results_["param_gamma"], 
              "Score": clf.cv_results_['mean_test_score']}
cv_results = pd.DataFrame(cv_results)
cv_results = cv_results[cv_results.columns].astype(float)
cv_results = cv_results.pivot("C", "gamma", "Score")
plt.figure(figsize=(10, 8))
sns.heatmap(cv_results, annot=True, square=True, cmap="YlGnBu", fmt='.3g')
plt.title('F1 Score');

C = 1000 and gamma = 1 seem to give the best F1 score. Let's try using these against the test dataset



In [26]:

    
clf_svm = SVC(C=1000, gamma=1)
clf_svm.fit(X_train, y_train)
predicted_labels = clf_svm.predict(X_test)

conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   134           1                                       135
     CSiS     5   326    11     1                                 343
     FSiS          14   270     4     1                           289
     SiSh           1     4    95     5                       1   106
       MS     1                 4    98     7                     110
       WS                             9   206    13     1         229
        D                             1    16    56     6     3    82
       PS                                   3     8   213     4   228
       BS                                         2     6    83    91

Precision  0.96  0.96  0.94  0.91  0.86  0.89  0.71  0.94  0.91  0.92
   Recall  0.99  0.95  0.93  0.90  0.89  0.90  0.68  0.93  0.91  0.92
       F1  0.97  0.95  0.94  0.90  0.88  0.89  0.70  0.94  0.91  0.92

DecisionTree classifier

Decision trees are very powerful for high entropy datasets however they have an tendancy to overfit. One way of controlling this is by limiting the depth of the tree using the max_depth parameter. We can use GridSearchCV for this too.



In [27]:

    
from sklearn import tree

parameters = {'max_depth': np.arange(2, 35)}
clf_dt = tree.DecisionTreeClassifier()
clf = GridSearchCV(clf_dt, parameters, n_jobs=-1, verbose=3, scoring="f1_micro")
clf.fit(X_train, y_train)









    



Fitting 3 folds for each of 33 candidates, totalling 99 fits






    



[Parallel(n_jobs=-1)]: Done  86 out of  99 | elapsed:    7.1s remaining:    1.0s
[Parallel(n_jobs=-1)]: Done  99 out of  99 | elapsed:    7.1s finished






    Out[27]:





GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': array([ 2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
       19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1_micro', verbose=3)



In [28]:

    
pd.DataFrame(clf.cv_results_).sort_values(by="rank_test_score").head()









    Out[28]:






  
    
      
      mean_fit_time
      mean_score_time
      mean_test_score
      mean_train_score
      param_max_depth
      params
      rank_test_score
      split0_test_score
      split0_train_score
      split1_test_score
      split1_train_score
      split2_test_score
      split2_train_score
      std_fit_time
      std_score_time
      std_test_score
      std_train_score
    
  
  
    
      29
      0.105011
      0.002334
      0.819629
      0.999759
      31
      {'max_depth': 31}
      1
      0.818013
      0.99969
      0.818407
      0.99969
      0.822471
      0.999897
      0.003559
      4.714266e-04
      0.002015
      0.000098
    
    
      24
      0.108011
      0.003000
      0.819354
      0.999552
      26
      {'max_depth': 26}
      2
      0.820078
      0.99969
      0.814685
      0.99969
      0.823298
      0.999277
      0.009900
      1.123916e-07
      0.003553
      0.000195
    
    
      32
      0.093343
      0.002000
      0.819147
      0.999759
      34
      {'max_depth': 34}
      3
      0.822351
      0.99969
      0.812203
      0.99969
      0.822884
      0.999897
      0.012473
      1.123916e-07
      0.004914
      0.000098
    
    
      23
      0.111678
      0.003000
      0.818733
      0.999449
      25
      {'max_depth': 25}
      4
      0.822144
      0.99969
      0.812410
      0.99969
      0.821643
      0.998967
      0.003092
      0.000000e+00
      0.004475
      0.000341
    
    
      27
      0.108011
      0.002667
      0.817768
      0.999759
      29
      {'max_depth': 29}
      5
      0.817393
      0.99969
      0.815305
      0.99969
      0.820608
      0.999897
      0.000817
      4.715390e-04
      0.002181
      0.000098



In [29]:

    
clf_dt = tree.DecisionTreeClassifier(max_depth=32)
clf_dt.fit(X_train, y_train)
predicted_labels = clf_dt.predict(X_test)

conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   122     7     5                 1                     135
     CSiS    13   306    24                                       343
     FSiS     1    25   258     1     1     2                 1   289
     SiSh           2     5    89     4     4           2         106
       MS     1                10    87    10           2         110
       WS     1                 1    17   183    18     8     1   229
        D                 1           1    11    57     8     4    82
       PS           1           3     5     9    12   192     6   228
       BS                                   2     3     8    78    91

Precision  0.88  0.90  0.88  0.86  0.76  0.82  0.63  0.87  0.87  0.85
   Recall  0.90  0.89  0.89  0.84  0.79  0.80  0.70  0.84  0.86  0.85
       F1  0.89  0.89  0.89  0.85  0.77  0.81  0.66  0.86  0.86  0.85

XGBoost classifier



In [30]:

    
import xgboost as xgb

clf_xgb = xgb.XGBClassifier()
clf_xgb.fit(X_train, y_train)
predicted_labels = clf_xgb.predict(X_test)

conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    88    40     6                 1                     135
     CSiS    18   267    56           1     1                     343
     FSiS     2    75   208     1           3                     289
     SiSh           2     8    76     2    16     2               106
       MS           2     1    10    44    44     3     6         110
       WS           1     1    13    12   169     3    27     3   229
        D                 1     4     3    18    34    20     2    82
       PS                 1     9     3    26    10   174     5   228
       BS                       1           4     1    18    67    91

Precision  0.81  0.69  0.74  0.67  0.68  0.60  0.64  0.71  0.87  0.70
   Recall  0.65  0.78  0.72  0.72  0.40  0.74  0.41  0.76  0.74  0.70
       F1  0.72  0.73  0.73  0.69  0.50  0.66  0.50  0.74  0.80  0.69

Naive Bayes classifier



In [31]:

    
from sklearn.naive_bayes import GaussianNB

clf_nb = GaussianNB()
clf_nb.fit(X_train, y_train)
predicted_labels = clf_nb.predict(X_test)

conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS    98    34     2                 1                     135
     CSiS    65   245    28     2           1     2               343
     FSiS    13   181    88     4           1                 2   289
     SiSh           7     1    33          47    16     2         106
       MS           3           9     1    90     6     1         110
       WS           2           5     6   188    11    17         229
        D                 1     8          25    31    14     3    82
       PS           1           5          99    12    95    16   228
       BS                       2           7     2    31    49    91

Precision  0.56  0.52  0.73  0.49  0.14  0.41  0.39  0.59  0.70  0.53
   Recall  0.73  0.71  0.30  0.31  0.01  0.82  0.38  0.42  0.54  0.51
       F1  0.63  0.60  0.43  0.38  0.02  0.55  0.38  0.49  0.61  0.48

AdaBoost classifier



In [32]:

    
from sklearn.ensemble import AdaBoostClassifier

clf_ab = AdaBoostClassifier(n_estimators=100)
clf_ab.fit(X_train, y_train)
predicted_labels = clf_ab.predict(X_test)

conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   104    29     1     1                                 135
     CSiS    98   191    48           1     2     2     1         343
     FSiS    25   165    86     2     4     4           3         289
     SiSh     1     4     3    36     6    21     4    31         106
       MS           2     1     6     2    77     2    20         110
       WS                 2    11     5   139     3    65     4   229
        D           1     1     4     3    33     7    32     1    82
       PS           1     1           3    59    14   149     1   228
       BS                                  19    23    49          91

Precision  0.46  0.49  0.60  0.60  0.08  0.39  0.13  0.43  0.00  0.42
   Recall  0.77  0.56  0.30  0.34  0.02  0.61  0.09  0.65  0.00  0.44
       F1  0.57  0.52  0.40  0.43  0.03  0.48  0.10  0.52  0.00  0.41






    



C:\Users\sayea5\iPython Notebooks\2016-ml-contest-master\Anjum48\classification_utilities.py:15: RuntimeWarning: invalid value encountered in true_divide
  F1 = 2 * (precision * recall) / (precision + recall)

Random Forest classifier



In [33]:

    
from sklearn.ensemble import RandomForestClassifier

clf_rf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
clf_rf.fit(X_train, y_train)
predicted_labels = clf_rf.predict(X_test)

conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   130     3     1                 1                     135
     CSiS     5   322    16                                       343
     FSiS          29   257     2           1                     289
     SiSh           1     3    95     5     2                     106
       MS                 1     8    87    12     1     1         110
       WS           1           5    12   202     5     4         229
        D                             3    22    50     5     2    82
       PS           1                 1     7     3   212     4   228
       BS                             1     1     2     6    81    91

Precision  0.96  0.90  0.92  0.86  0.80  0.81  0.82  0.93  0.93  0.89
   Recall  0.96  0.94  0.89  0.90  0.79  0.88  0.61  0.93  0.89  0.89
       F1  0.96  0.92  0.91  0.88  0.79  0.85  0.70  0.93  0.91  0.89

Nearest Neighbours classifier

This classifer has two parameters that can be tuned: the number of neighbours to use, and the method of weighting the neighbours (uniform or distance based). We'll use GridSearchCV again to find the optimal set.



In [34]:

    
from sklearn import neighbors

parameters = {'n_neighbors': np.arange(1, 25), 'weights': ['uniform', 'distance']}
knn = neighbors.KNeighborsClassifier()
clf = GridSearchCV(knn, parameters, n_jobs=-1, verbose=3, scoring="f1_micro")
clf.fit(X_train, y_train)









    



Fitting 3 folds for each of 48 candidates, totalling 144 fits






    



[Parallel(n_jobs=-1)]: Done  80 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed:    9.1s finished






    Out[34]:





GridSearchCV(cv=None, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'weights': ['uniform', 'distance'], 'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='f1_micro', verbose=3)



In [35]:

    
pd.DataFrame(clf.cv_results_).sort_values(by="rank_test_score").head(10)









    Out[35]:






  
    
      
      mean_fit_time
      mean_score_time
      mean_test_score
      mean_train_score
      param_n_neighbors
      param_weights
      params
      rank_test_score
      split0_test_score
      split0_train_score
      split1_test_score
      split1_train_score
      split2_test_score
      split2_train_score
      std_fit_time
      std_score_time
      std_test_score
      std_train_score
    
  
  
    
      0
      0.011668
      0.036670
      0.907712
      0.999759
      1
      uniform
      {'weights': 'uniform', 'n_neighbors': 1}
      1
      0.905185
      0.999690
      0.903619
      0.999690
      0.914339
      0.999897
      1.699846e-03
      4.110098e-03
      0.004727
      0.000098
    
    
      1
      0.010001
      0.031670
      0.907712
      0.999759
      1
      distance
      {'weights': 'distance', 'n_neighbors': 1}
      1
      0.905185
      0.999690
      0.903619
      0.999690
      0.914339
      0.999897
      0.000000e+00
      4.713704e-04
      0.004727
      0.000098
    
    
      3
      0.010001
      0.044004
      0.907644
      0.999759
      2
      distance
      {'weights': 'distance', 'n_neighbors': 2}
      3
      0.905185
      0.999690
      0.903413
      0.999690
      0.914339
      0.999897
      0.000000e+00
      0.000000e+00
      0.004787
      0.000098
    
    
      5
      0.010001
      0.057006
      0.898270
      0.999759
      3
      distance
      {'weights': 'distance', 'n_neighbors': 3}
      4
      0.894650
      0.999690
      0.893071
      0.999690
      0.907097
      0.999897
      0.000000e+00
      1.123916e-07
      0.006272
      0.000098
    
    
      7
      0.010001
      0.067340
      0.896202
      0.999759
      4
      distance
      {'weights': 'distance', 'n_neighbors': 4}
      5
      0.894856
      0.999690
      0.889969
      0.999690
      0.903786
      0.999897
      0.000000e+00
      9.429093e-04
      0.005719
      0.000098
    
    
      9
      0.010334
      0.100343
      0.889724
      0.999759
      5
      distance
      {'weights': 'distance', 'n_neighbors': 5}
      6
      0.889899
      0.999690
      0.882937
      0.999690
      0.896338
      0.999897
      4.714266e-04
      3.301170e-02
      0.005471
      0.000098
    
    
      11
      0.010001
      0.142681
      0.886622
      0.999759
      6
      distance
      {'weights': 'distance', 'n_neighbors': 6}
      7
      0.885354
      0.999690
      0.879421
      0.999690
      0.895096
      0.999897
      1.123916e-07
      4.084800e-02
      0.006460
      0.000098
    
    
      2
      0.010668
      0.043004
      0.884830
      0.951582
      2
      uniform
      {'weights': 'uniform', 'n_neighbors': 2}
      8
      0.882049
      0.952731
      0.886660
      0.950589
      0.885785
      0.951426
      4.713142e-04
      8.164374e-04
      0.002000
      0.000881
    
    
      13
      0.011334
      0.155015
      0.881591
      0.999759
      7
      distance
      {'weights': 'distance', 'n_neighbors': 7}
      9
      0.879777
      0.999690
      0.874457
      0.999690
      0.890544
      0.999897
      1.885762e-03
      3.001412e-02
      0.006690
      0.000098
    
    
      4
      0.010334
      0.056006
      0.880695
      0.944690
      3
      uniform
      {'weights': 'uniform', 'n_neighbors': 3}
      10
      0.879364
      0.948283
      0.875284
      0.942526
      0.887441
      0.943262
      4.714266e-04
      8.165347e-04
      0.005050
      0.002558

It looks like several different parameter combinations give the same score. We'll pick the one using the most neighbours: n_neighbors=1 and weights="distance" (the latter has no effect with k=1)



In [36]:

    
clf_knn = neighbors.KNeighborsClassifier(1, weights="distance")
clf_knn.fit(X_train, y_train)
predicted_labels = clf_knn.predict(X_test)

conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   133     2                                             135
     CSiS     6   324    12     1                                 343
     FSiS          14   269     4     1     1                     289
     SiSh           1     6    95     3                       1   106
       MS                 1     2    95    12                     110
       WS                       1     9   206    12     1         229
        D                       1          11    58     9     3    82
       PS                             1     1     9   212     5   228
       BS                                         1     5    85    91

Precision  0.96  0.95  0.93  0.91  0.87  0.89  0.72  0.93  0.90  0.92
   Recall  0.99  0.94  0.93  0.90  0.86  0.90  0.71  0.93  0.93  0.92
       F1  0.97  0.95  0.93  0.90  0.87  0.90  0.72  0.93  0.92  0.92

TensorFlow DNN



In [37]:

    
import tensorflow as tf
from tensorflow.contrib import learn
from tensorflow.contrib import layers

feature_columns = learn.infer_real_valued_columns_from_input(X_train)
clf_dnn = learn.DNNClassifier(feature_columns=feature_columns, hidden_units=[200, 400, 200], n_classes=10)

clf_dnn.fit(X_train, y_train, steps=5000)
predicted_labels = list(clf_dnn.predict(X_test, as_iterable=True))

conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)









    



WARNING:tensorflow:float64 is not supported by many models, consider casting to float32.
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\sayea5\AppData\Local\Temp\tmp92cd3dsv
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'keep_checkpoint_every_n_hours': 10000, '_num_ps_replicas': 0, 'tf_random_seed': None, 'save_summary_steps': 100, 'tf_config': gpu_options {
  per_process_gpu_memory_fraction: 1
}
, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000000012756438>, '_master': '', '_task_id': 0, '_task_type': None, 'save_checkpoints_secs': 600, '_is_chief': True, 'save_checkpoints_steps': None, '_environment': 'local', '_evaluation_master': '', 'keep_checkpoint_max': 5}
WARNING:tensorflow:From C:\Users\sayea5\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\dnn.py:315 in fit.: calling BaseEstimator.fit (from tensorflow.contrib.learn.python.learn.estimators.estimator) with x is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
WARNING:tensorflow:From C:\Users\sayea5\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\dnn.py:315 in fit.: calling BaseEstimator.fit (from tensorflow.contrib.learn.python.learn.estimators.estimator) with y is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
WARNING:tensorflow:From C:\Users\sayea5\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\dnn.py:315 in fit.: calling BaseEstimator.fit (from tensorflow.contrib.learn.python.learn.estimators.estimator) with batch_size is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
WARNING:tensorflow:float64 is not supported by many models, consider casting to float32.
INFO:tensorflow:Summary name dnn/hiddenlayer_0:fraction_of_zero_values is illegal; using dnn/hiddenlayer_0_fraction_of_zero_values instead.
INFO:tensorflow:Summary name dnn/hiddenlayer_0:activation is illegal; using dnn/hiddenlayer_0_activation instead.
INFO:tensorflow:Summary name dnn/hiddenlayer_1:fraction_of_zero_values is illegal; using dnn/hiddenlayer_1_fraction_of_zero_values instead.
INFO:tensorflow:Summary name dnn/hiddenlayer_1:activation is illegal; using dnn/hiddenlayer_1_activation instead.
INFO:tensorflow:Summary name dnn/hiddenlayer_2:fraction_of_zero_values is illegal; using dnn/hiddenlayer_2_fraction_of_zero_values instead.
INFO:tensorflow:Summary name dnn/hiddenlayer_2:activation is illegal; using dnn/hiddenlayer_2_activation instead.
INFO:tensorflow:Summary name dnn/logits:fraction_of_zero_values is illegal; using dnn/logits_fraction_of_zero_values instead.
INFO:tensorflow:Summary name dnn/logits:activation is illegal; using dnn/logits_activation instead.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:loss = 2.32257, step = 1
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\sayea5\AppData\Local\Temp\tmp92cd3dsv\model.ckpt.
WARNING:tensorflow:*******************************************************
WARNING:tensorflow:TensorFlow's V1 checkpoint format has been deprecated.
WARNING:tensorflow:Consider switching to the more efficient V2 format:
WARNING:tensorflow:   `tf.train.Saver(write_version=tf.train.SaverDef.V2)`
WARNING:tensorflow:now on by default.
WARNING:tensorflow:*******************************************************
INFO:tensorflow:loss = 1.07682, step = 101
INFO:tensorflow:global_step/sec: 4.72119
INFO:tensorflow:loss = 1.01264, step = 201
INFO:tensorflow:global_step/sec: 4.89044
INFO:tensorflow:loss = 0.973991, step = 301
INFO:tensorflow:global_step/sec: 4.89523
INFO:tensorflow:loss = 0.941887, step = 401
INFO:tensorflow:global_step/sec: 4.91424
INFO:tensorflow:loss = 0.913601, step = 501
INFO:tensorflow:global_step/sec: 4.90917
INFO:tensorflow:loss = 0.887817, step = 601
INFO:tensorflow:global_step/sec: 4.92271
INFO:tensorflow:loss = 0.862841, step = 701
INFO:tensorflow:global_step/sec: 4.88089
INFO:tensorflow:loss = 0.839388, step = 801
INFO:tensorflow:global_step/sec: 4.9198
INFO:tensorflow:loss = 0.818133, step = 901
INFO:tensorflow:global_step/sec: 4.89667
INFO:tensorflow:loss = 0.797668, step = 1001
INFO:tensorflow:global_step/sec: 4.89715
INFO:tensorflow:loss = 0.788319, step = 1101
INFO:tensorflow:global_step/sec: 4.90628
INFO:tensorflow:loss = 0.758846, step = 1201
INFO:tensorflow:global_step/sec: 4.80999
INFO:tensorflow:loss = 0.748113, step = 1301
INFO:tensorflow:global_step/sec: 4.90363
INFO:tensorflow:loss = 0.723232, step = 1401
INFO:tensorflow:global_step/sec: 4.87115
INFO:tensorflow:loss = 0.704529, step = 1501
INFO:tensorflow:global_step/sec: 4.89955
INFO:tensorflow:loss = 0.690109, step = 1601
INFO:tensorflow:global_step/sec: 4.9046
INFO:tensorflow:loss = 0.680026, step = 1701
INFO:tensorflow:global_step/sec: 4.88256
INFO:tensorflow:loss = 0.660257, step = 1801
INFO:tensorflow:global_step/sec: 4.90291
INFO:tensorflow:loss = 0.65416, step = 1901
INFO:tensorflow:global_step/sec: 4.89715
INFO:tensorflow:loss = 0.662714, step = 2001
INFO:tensorflow:global_step/sec: 4.91617
INFO:tensorflow:loss = 0.607283, step = 2101
INFO:tensorflow:global_step/sec: 4.86996
INFO:tensorflow:loss = 0.617725, step = 2201
INFO:tensorflow:global_step/sec: 4.88614
INFO:tensorflow:loss = 0.614804, step = 2301
INFO:tensorflow:global_step/sec: 4.91617
INFO:tensorflow:loss = 0.623854, step = 2401
INFO:tensorflow:global_step/sec: 4.8281
INFO:tensorflow:loss = 0.560081, step = 2501
INFO:tensorflow:global_step/sec: 4.79706
INFO:tensorflow:loss = 0.556521, step = 2601
INFO:tensorflow:global_step/sec: 4.83487
INFO:tensorflow:loss = 0.522673, step = 2701
INFO:tensorflow:global_step/sec: 4.89571
INFO:tensorflow:loss = 0.538951, step = 2801
INFO:tensorflow:global_step/sec: 4.87899
INFO:tensorflow:loss = 0.536409, step = 2901
INFO:tensorflow:global_step/sec: 4.87923
INFO:tensorflow:Saving checkpoints for 2930 into C:\Users\sayea5\AppData\Local\Temp\tmp92cd3dsv\model.ckpt.
WARNING:tensorflow:*******************************************************
WARNING:tensorflow:TensorFlow's V1 checkpoint format has been deprecated.
WARNING:tensorflow:Consider switching to the more efficient V2 format:
WARNING:tensorflow:   `tf.train.Saver(write_version=tf.train.SaverDef.V2)`
WARNING:tensorflow:now on by default.
WARNING:tensorflow:*******************************************************
INFO:tensorflow:loss = 0.512003, step = 3001
INFO:tensorflow:global_step/sec: 4.74831
INFO:tensorflow:loss = 0.516283, step = 3101
INFO:tensorflow:global_step/sec: 4.78055
INFO:tensorflow:loss = 0.505689, step = 3201
INFO:tensorflow:global_step/sec: 4.86096
INFO:tensorflow:loss = 0.459252, step = 3301
INFO:tensorflow:global_step/sec: 4.92077
INFO:tensorflow:loss = 0.489185, step = 3401
INFO:tensorflow:global_step/sec: 4.90556
INFO:tensorflow:loss = 0.458353, step = 3501
INFO:tensorflow:global_step/sec: 4.92004
INFO:tensorflow:loss = 0.471119, step = 3601
INFO:tensorflow:global_step/sec: 4.93042
INFO:tensorflow:loss = 0.476558, step = 3701
INFO:tensorflow:global_step/sec: 4.92743
INFO:tensorflow:loss = 0.421658, step = 3801
INFO:tensorflow:global_step/sec: 4.88584
INFO:tensorflow:loss = 0.468468, step = 3901
INFO:tensorflow:global_step/sec: 4.83232
INFO:tensorflow:loss = 0.442959, step = 4001
INFO:tensorflow:global_step/sec: 4.81096
INFO:tensorflow:loss = 0.385229, step = 4101
INFO:tensorflow:global_step/sec: 4.92961
INFO:tensorflow:loss = 0.41467, step = 4201
INFO:tensorflow:global_step/sec: 4.89971
INFO:tensorflow:loss = 0.405952, step = 4301
INFO:tensorflow:global_step/sec: 4.873
INFO:tensorflow:loss = 0.370555, step = 4401
INFO:tensorflow:global_step/sec: 4.9098
INFO:tensorflow:loss = 0.373946, step = 4501
INFO:tensorflow:global_step/sec: 4.90787
INFO:tensorflow:loss = 0.377228, step = 4601
INFO:tensorflow:global_step/sec: 4.8806
INFO:tensorflow:loss = 0.455364, step = 4701
INFO:tensorflow:global_step/sec: 4.89014
INFO:tensorflow:loss = 0.328196, step = 4801
INFO:tensorflow:global_step/sec: 4.88703
INFO:tensorflow:loss = 0.378045, step = 4901
INFO:tensorflow:global_step/sec: 4.86897
INFO:tensorflow:Saving checkpoints for 5000 into C:\Users\sayea5\AppData\Local\Temp\tmp92cd3dsv\model.ckpt.
WARNING:tensorflow:*******************************************************
WARNING:tensorflow:TensorFlow's V1 checkpoint format has been deprecated.
WARNING:tensorflow:Consider switching to the more efficient V2 format:
WARNING:tensorflow:   `tf.train.Saver(write_version=tf.train.SaverDef.V2)`
WARNING:tensorflow:now on by default.
WARNING:tensorflow:*******************************************************
INFO:tensorflow:Loss for final step: 0.32663.
WARNING:tensorflow:From C:\Users\sayea5\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\dnn.py:348 in predict.: calling BaseEstimator.predict (from tensorflow.contrib.learn.python.learn.estimators.estimator) with x is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
WARNING:tensorflow:From C:\Users\sayea5\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\dnn.py:348 in predict.: calling BaseEstimator.predict (from tensorflow.contrib.learn.python.learn.estimators.estimator) with batch_size is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
WARNING:tensorflow:From C:\Users\sayea5\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\contrib\learn\python\learn\estimators\dnn.py:348 in predict.: calling BaseEstimator.predict (from tensorflow.contrib.learn.python.learn.estimators.estimator) with as_iterable is deprecated and will be removed after 2016-12-01.
Instructions for updating:
Estimator is decoupled from Scikit Learn interface by moving into
separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion:
  est = Estimator(...) -> est = SKCompat(Estimator(...))
WARNING:tensorflow:float64 is not supported by many models, consider casting to float32.
INFO:tensorflow:Summary name dnn/hiddenlayer_0:fraction_of_zero_values is illegal; using dnn/hiddenlayer_0_fraction_of_zero_values instead.
INFO:tensorflow:Summary name dnn/hiddenlayer_0:activation is illegal; using dnn/hiddenlayer_0_activation instead.
INFO:tensorflow:Summary name dnn/hiddenlayer_1:fraction_of_zero_values is illegal; using dnn/hiddenlayer_1_fraction_of_zero_values instead.
INFO:tensorflow:Summary name dnn/hiddenlayer_1:activation is illegal; using dnn/hiddenlayer_1_activation instead.
INFO:tensorflow:Summary name dnn/hiddenlayer_2:fraction_of_zero_values is illegal; using dnn/hiddenlayer_2_fraction_of_zero_values instead.
INFO:tensorflow:Summary name dnn/hiddenlayer_2:activation is illegal; using dnn/hiddenlayer_2_activation instead.
INFO:tensorflow:Summary name dnn/logits:fraction_of_zero_values is illegal; using dnn/logits_fraction_of_zero_values instead.
INFO:tensorflow:Summary name dnn/logits:activation is illegal; using dnn/logits_activation instead.
INFO:tensorflow:Loading model from checkpoint: C:\Users\sayea5\AppData\Local\Temp\tmp92cd3dsv\model.ckpt-5000-?????-of-00001.
     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   118    14     2                 1                     135
     CSiS     2   309    32                                       343
     FSiS     2    25   259     2                       1         289
     SiSh           2     9    92           2           1         106
       MS                 3    16    65     8          18         110
       WS                 1     9     2   175     8    34         229
        D                       1           8    50    20     3    82
       PS                 1     2     1     2     3   216     3   228
       BS                                         2    10    79    91

Precision  0.97  0.88  0.84  0.75  0.96  0.89  0.79  0.72  0.93  0.86
   Recall  0.87  0.90  0.90  0.87  0.59  0.76  0.61  0.95  0.87  0.85
       F1  0.92  0.89  0.87  0.81  0.73  0.82  0.69  0.82  0.90  0.84

Ensemble classifier

The best classifiers so far are SVM, DecisionTree, KNN, RandomForest and DNN. Lets put all 5 together and take a majority vote to decide the final classification. Due to the way TensorFlow estimators work, they are currently incompatible with the VotingClassifier in scikit-learn, so we'll do the majority vote using the modal value calculated in a pandas Dataframe.

Update In practice, using XGBoost instead of the DNN gave slightly better F1 scores.



In [38]:

    
from sklearn.ensemble import VotingClassifier

eclf = VotingClassifier(estimators=[('SVM', clf_svm), 
                                    ('DecisionTree', clf_dt), 
                                    ('KNN', clf_knn), 
                                    ('RandomForest', clf_rf), 
                                    ('XGBoost', clf_xgb)
                                   ], 
                        voting='hard')

eclf.fit(X_train, y_train)
predicted_labels = eclf.predict(X_test)

conf = confusion_matrix(y_test, predicted_labels)
display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   131     2     1                 1                     135
     CSiS     3   330    10                                       343
     FSiS          21   265     2           1                     289
     SiSh           1     6    96     2     1                     106
       MS                 1     8    91    10                     110
       WS                       3     8   208     9     1         229
        D                                  18    54     7     3    82
       PS           1                       3     6   214     4   228
       BS                                   1     2     7    81    91

Precision  0.98  0.93  0.94  0.88  0.90  0.86  0.76  0.93  0.92  0.91
   Recall  0.97  0.96  0.92  0.91  0.83  0.91  0.66  0.94  0.89  0.91
       F1  0.97  0.95  0.93  0.89  0.86  0.88  0.71  0.94  0.91  0.91



In [39]:

    
display_adj_cm(conf, facies_labels, adjacent_facies, display_metrics=True, hide_zeros=True)









    



     Pred    SS  CSiS  FSiS  SiSh    MS    WS     D    PS    BS Total
     True
       SS   133           1                 1                     135
     CSiS         343                                             343
     FSiS               286     2           1                     289
     SiSh           1     6    98           1                     106
       MS                 1         109                           110
       WS                       3         226                     229
        D                                        79           3    82
       PS           1                                 227         228
       BS                                   1                90    91

Precision  1.00  0.99  0.97  0.95  1.00  0.98  1.00  1.00  0.97  0.99
   Recall  0.99  1.00  0.99  0.92  0.99  0.99  0.96  1.00  0.99  0.99
       F1  0.99  1.00  0.98  0.94  1.00  0.98  0.98  1.00  0.98  0.99

Use a homemade majority voting to use with TensorFlow



In [40]:

    
# classifiers = {
#     "SVM": SVC(C=1000, gamma=1),
#     "DecisionTree": tree.DecisionTreeClassifier(max_depth=32),
#     "KNN": neighbors.KNeighborsClassifier(1, weights="distance"),
# #     "DNN": learn.DNNClassifier(feature_columns=feature_columns, hidden_units=[200, 400, 200], n_classes=10),
#     "RandomForest": RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0),
#     "XGBoost": xgb.XGBClassifier()
#               }

# def fit_and_predict(X_train, y_train, X_test, classifiers):
#     predicted_values = {}
#     for key, classifier in classifiers.items():
#         if key == "DNN":
#             classifier.fit(X_train, y_train, steps=5000)
#             list(classifier.predict(X_test, as_iterable=True))
#         else:
#             classifier.fit(X_train, y_train)
#             predicted_values[key] = classifier.predict(X_test)
#     return pd.DataFrame(predicted_values)



In [41]:

    
# predicted_values = fit_and_predict(X_train, y_train, X_test, classifiers)



In [42]:

    
# majority_vote = predicted_values.mode(axis=1)[0]

# conf = confusion_matrix(y_test, majority_vote.fillna(2))
# display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)



In [43]:

    
# display_adj_cm(conf, facies_labels, adjacent_facies, display_metrics=True, hide_zeros=True)

Average test F1 score with leave one well out



In [53]:

    
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.metrics import f1_score, classification_report

f1_eclf = []
logo = LeaveOneGroupOut()

for train, test in logo.split(scaled_features, correct_facies_labels, groups=well_names):  
    eclf.fit(scaled_features[train], correct_facies_labels[train])
    pred = eclf.predict(scaled_features[test])
    sc = f1_score(correct_facies_labels[test], pred, labels=np.arange(10), average='micro')
    
    well_name = well_names[test[0]]
    print("{}  {:.3f}".format(well_name, sc))
    f1_eclf.append(sc)
    
#     conf = confusion_matrix(correct_facies_labels[test], pred)
#     display_cm(conf, facies_labels, display_metrics=True, hide_zeros=True)
#     print("")
    
print("Average leave-one-well-out F1 Score: %6f" % (sum(f1_eclf)/(1.0*(len(f1_eclf)))))









    



NEWBY  0.509
CROSS H CATTLE  0.361
LUKE G U  0.512
Recruit F9  0.393
CROSS H CATTLE  0.429
NOLAN  0.848
SHANKLE  0.422
SHRIMPLIN  0.550
Average leave-one-well-out F1 Score: 0.503202

Issue I'm not sure why the leave one well out scores are so much worse than the randomised train/test split (most likely variation between wells due to a combination of geology and well log quality). Need more work in this area

Prediction

Let's use the full dataset to train the models and make the prediction



In [45]:

    
#Load testing data and standardise
test_data = pd.read_csv('../validation_data_nofacies.csv')
test_features = test_data.drop(['Well Name', 'Depth', "Formation"], axis=1)
scaled_test_features = scaler.transform(test_features)



In [46]:

    
eclf.fit(scaled_features, correct_facies_labels)
predicted_test_labels = eclf.predict(scaled_test_features)

# Save predicted labels
test_data['Facies'] = predicted_test_labels
test_data.to_csv('Anjum48_Prediction_Submission.csv')



In [47]:

    
# predicted_test_labels = fit_and_predict(scaled_features, correct_facies_labels, 
#                                         scaled_test_features, classifiers)



In [48]:

    
# # Save predicted labels
# test_data['Facies'] = predicted_test_labels.mode(axis=1)[0]
# test_data.to_csv('Anjum48_Prediction_Submission.csv')



In [49]:

    
# Plot predicted labels
make_facies_log_plot(
    test_data[test_data['Well Name'] == 'STUART'],
    facies_colors=facies_colors)

make_facies_log_plot(
    test_data[test_data['Well Name'] == 'CRAWFORD'],
    facies_colors=facies_colors)
mpl.rcParams.update(inline_rc)

Interestingly in the test wells, there appears to be some bad data where the logs appear to be linearly interpolated, e.g. ~3025mMD in Crawford

Future work/suggestions

Get the TensorFlow models to work in the VotingClassifier
Try a LSTM model using TensorFlow and use previous depths as features
Try some of the interesting feature engineering solutions shown by others (e.g. Paolo Bestagini's work, wavelet transforms etc.)
Use a normalised GR - this requires a bit more info on the geology to find a fieldwide correlatable high (shale) and low (sand/carbonate) GR events
Check for hydrocarbon bearing intervals and split the data into gas/oil and water zones. Alternatively, calculate water saturation using Archie's equation and use that as a feature (assume m = n = 2)
Detailed log QC removing intervals that look like they have suspicious data
Try incorporating the wells with missing PEF data. Some people have used a regression technique to predict this, but given the physics of the measurement this may/may not be appropriate
Try using TPOT or tuning XGBoost

The leave one group out analysis showed a massive reduction in the F1 score. More work is needed to understand why this is, using LOGO type CV would probably be more appropriate for training models



In [ ]:

	Facies	Formation	Well Name	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
0	3	A1 SH	SHRIMPLIN	2793.0	77.450	0.664	9.900	11.915	4.600	1	1.000
1	3	A1 SH	SHRIMPLIN	2793.5	78.260	0.661	14.200	12.565	4.100	1	0.979
2	3	A1 SH	SHRIMPLIN	2794.0	79.050	0.658	14.800	13.050	3.600	1	0.957
3	3	A1 SH	SHRIMPLIN	2794.5	86.100	0.655	13.900	13.115	3.500	1	0.936
4	3	A1 SH	SHRIMPLIN	2795.0	74.580	0.647	13.500	13.300	3.400	1	0.915
...	...	...	...	...	...	...	...	...	...	...	...
4144	5	C LM	CHURCHMAN BIBLE	3120.5	46.719	0.947	1.828	7.254	3.617	2	0.685
4145	5	C LM	CHURCHMAN BIBLE	3121.0	44.563	0.953	2.241	8.013	3.344	2	0.677
4146	5	C LM	CHURCHMAN BIBLE	3121.5	49.719	0.964	2.925	8.013	3.190	2	0.669
4147	5	C LM	CHURCHMAN BIBLE	3122.0	51.469	0.965	3.083	7.708	3.152	2	0.661
4148	5	C LM	CHURCHMAN BIBLE	3122.5	50.031	0.970	2.609	6.668	3.295	2	0.653

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000	3232.000000
mean	4.422030	2875.824567	66.135769	0.642719	3.559642	13.483213	3.725014	1.498453	0.520287
std	2.504243	131.006274	30.854826	0.241845	5.228948	7.698980	0.896152	0.500075	0.286792
min	1.000000	2573.500000	13.250000	-0.025949	-21.832000	0.550000	0.200000	1.000000	0.010000
25%	2.000000	2791.000000	46.918750	0.492750	1.163750	8.346750	3.100000	1.000000	0.273000
50%	4.000000	2893.500000	65.721500	0.624437	3.500000	12.150000	3.551500	1.000000	0.526000
75%	6.000000	2980.000000	79.626250	0.812735	6.432500	16.453750	4.300000	2.000000	0.767250
max	9.000000	3122.500000	361.150000	1.480000	18.600000	84.400000	8.094000	2.000000	1.000000

	Facies	Depth	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
count	16122.000000	16122.000000	16122.000000	16122.000000	16122.000000	16122.000000	16122.000000	16122.000000	16122.000000
mean	4.420233	2875.779266	66.132765	0.642385	3.562926	13.488510	3.724287	1.498325	0.520044
std	2.492889	130.886683	30.795497	0.241821	5.207364	7.684145	0.894569	0.500013	0.284700
min	1.000000	2573.500000	12.759470	-0.026553	-21.923002	-1.112217	0.058033	1.000000	-0.096527
25%	2.000000	2791.000000	46.939636	0.490246	1.144971	8.359510	3.113000	1.000000	0.279000
50%	4.000000	2893.600003	65.816922	0.624654	3.461482	12.138445	3.546221	1.000000	0.525477
75%	6.000000	2980.000000	79.864096	0.812950	6.483459	16.492567	4.306479	2.000000	0.761796
max	9.000000	3122.500000	375.701116	1.487703	20.032203	84.606003	8.094000	2.000000	1.201765

	GR	ILD_log10	DeltaPHI	PHIND	PE	NM_M	RELPOS
0	77.450000	0.664000	9.900000	11.915000	4.600000	1	1.000000
1	79.284515	0.663545	11.150359	12.015714	4.546545	1	0.996012
2	79.937114	0.662987	12.190536	12.139619	4.460793	1	0.991895
3	79.753455	0.662357	13.035533	12.278167	4.351767	1	0.987673
4	79.079198	0.661685	13.700354	12.422809	4.228495	1	0.983367
...	...	...	...	...	...	...	...
16122	51.177622	0.964821	3.026104	7.597464	3.161609	2	0.659412
16123	50.791689	0.964974	2.945037	7.451181	3.179267	2	0.657823
16124	50.407644	0.965716	2.845018	7.256165	3.206420	2	0.656228
16125	50.121933	0.967306	2.731266	6.999433	3.244515	2	0.654622
16126	50.031000	0.970000	2.609000	6.668000	3.295000	2	0.653000

	mean_fit_time	mean_score_time	mean_test_score	mean_train_score	param_max_depth	params	rank_test_score	split0_test_score	split0_train_score	split1_test_score	split1_train_score	split2_test_score	split2_train_score	std_fit_time	std_score_time	std_test_score	std_train_score
29	0.105011	0.002334	0.819629	0.999759	31	{'max_depth': 31}	1	0.818013	0.99969	0.818407	0.99969	0.822471	0.999897	0.003559	4.714266e-04	0.002015	0.000098
24	0.108011	0.003000	0.819354	0.999552	26	{'max_depth': 26}	2	0.820078	0.99969	0.814685	0.99969	0.823298	0.999277	0.009900	1.123916e-07	0.003553	0.000195
32	0.093343	0.002000	0.819147	0.999759	34	{'max_depth': 34}	3	0.822351	0.99969	0.812203	0.99969	0.822884	0.999897	0.012473	1.123916e-07	0.004914	0.000098
23	0.111678	0.003000	0.818733	0.999449	25	{'max_depth': 25}	4	0.822144	0.99969	0.812410	0.99969	0.821643	0.998967	0.003092	0.000000e+00	0.004475	0.000341
27	0.108011	0.002667	0.817768	0.999759	29	{'max_depth': 29}	5	0.817393	0.99969	0.815305	0.99969	0.820608	0.999897	0.000817	4.715390e-04	0.002181	0.000098

	mean_fit_time	mean_score_time	mean_test_score	mean_train_score	param_n_neighbors	param_weights	params	rank_test_score	split0_test_score	split0_train_score	split1_test_score	split1_train_score	split2_test_score	split2_train_score	std_fit_time	std_score_time	std_test_score	std_train_score
0	0.011668	0.036670	0.907712	0.999759	1	uniform	{'weights': 'uniform', 'n_neighbors': 1}	1	0.905185	0.999690	0.903619	0.999690	0.914339	0.999897	1.699846e-03	4.110098e-03	0.004727	0.000098
1	0.010001	0.031670	0.907712	0.999759	1	distance	{'weights': 'distance', 'n_neighbors': 1}	1	0.905185	0.999690	0.903619	0.999690	0.914339	0.999897	0.000000e+00	4.713704e-04	0.004727	0.000098
3	0.010001	0.044004	0.907644	0.999759	2	distance	{'weights': 'distance', 'n_neighbors': 2}	3	0.905185	0.999690	0.903413	0.999690	0.914339	0.999897	0.000000e+00	0.000000e+00	0.004787	0.000098
5	0.010001	0.057006	0.898270	0.999759	3	distance	{'weights': 'distance', 'n_neighbors': 3}	4	0.894650	0.999690	0.893071	0.999690	0.907097	0.999897	0.000000e+00	1.123916e-07	0.006272	0.000098
7	0.010001	0.067340	0.896202	0.999759	4	distance	{'weights': 'distance', 'n_neighbors': 4}	5	0.894856	0.999690	0.889969	0.999690	0.903786	0.999897	0.000000e+00	9.429093e-04	0.005719	0.000098
9	0.010334	0.100343	0.889724	0.999759	5	distance	{'weights': 'distance', 'n_neighbors': 5}	6	0.889899	0.999690	0.882937	0.999690	0.896338	0.999897	4.714266e-04	3.301170e-02	0.005471	0.000098
11	0.010001	0.142681	0.886622	0.999759	6	distance	{'weights': 'distance', 'n_neighbors': 6}	7	0.885354	0.999690	0.879421	0.999690	0.895096	0.999897	1.123916e-07	4.084800e-02	0.006460	0.000098
2	0.010668	0.043004	0.884830	0.951582	2	uniform	{'weights': 'uniform', 'n_neighbors': 2}	8	0.882049	0.952731	0.886660	0.950589	0.885785	0.951426	4.713142e-04	8.164374e-04	0.002000	0.000881
13	0.011334	0.155015	0.881591	0.999759	7	distance	{'weights': 'distance', 'n_neighbors': 7}	9	0.879777	0.999690	0.874457	0.999690	0.890544	0.999897	1.885762e-03	3.001412e-02	0.006690	0.000098
4	0.010334	0.056006	0.880695	0.944690	3	uniform	{'weights': 'uniform', 'n_neighbors': 3}	10	0.879364	0.948283	0.875284	0.942526	0.887441	0.943262	4.714266e-04	8.165347e-04	0.005050	0.002558